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Abstract 


We present a method that automatically evaluates emotional response from spontaneous facial 
activity recorded by a depth camera. The automatic evaluation of emotional response, or affect, is 
a fascinating challenge with many applications, including human-computer interaction, media 
tagging and human affect prediction. Our approach in addressing this problem is based on 
the inferred activity of facial muscles over time, as captured by a depth camera recording an 
individual’s facial activity. Our contribution is two-fold: First, we constructed a database of 
publicly available short video clips, which elicit a strong emotional response in a consistent 
manner across different individuals. Each video was tagged by its characteristic emotional 
response along 4 scales: Valence, Arousal, Likability and Rewatch (the desire to watch again). 
The second contribution is a two-step prediction method, based on learning, which was trained 
and tested using this database of tagged video clips. Our method was able to successfully predict 
the aforementioned 4 dimensional representation of affect, as well as to identify the period of 
strongest emotional response in the viewing recordings, in a method that is blind to the video clip 
being watch, revealing a significantly high agreement between the recordings of independent 
viewers. 
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Introduction 


T he Roman Philosopher Cicero wrote, "The face is a picture of the mind". This statement 
has been discussed and debated repeatedly over the years, within the scientific community 
and outside it — are the face really a window to the soul? Does human emotion reflect 
in facial expressions? There is a vast agreement among scholars that the answer is, at least 
partially - yes. Hence, we asked the following question - could it be done automatically? That is, 
could computer vision tools be utilized to evaluate humans’ emotional state, given their facial 
expressions? 

In the past two decades we had witnessed an increased interest in automatic methods that 
extract and analyze human’s emotions, or affective state. The potential applications of automatic 
affect recognition vary from human computer interaction to emotional media tagging, including 
for example the creation of a user’s profile in various platforms, building emotion-driven HCI 
systems, and emotion-based tagging of dating sites, videos on YouTube or posts on Facebook. 
Indeed, in recent years media tagging has received much attention in the research community 
(e.g. [83, 92]). 

In this work we took advantage of the emerging technology of depth cameras. Recently, depth 
cameras based on structured light technology have emerged as a means to achieve effective 
human computer interaction in gaming, based on both gestures and facial expressions [1]. We 
used a depth camera (Carmine 1.09) to record participants facial response to a set of video clips 
designed to elicit emotional response, and developed two types of pertinent prediction models 
for automatic quantitative evaluation of affect: models for tagging video clips from human facial 
expressions (i.e. implicit media tagging), and models for predicting viewers affective state, given 
their facial behavior (i.e. affect prediction). Respectively, a clear separation between them should 
be drawn. 

Both implicit media tagging and affect prediction concern the estimation of emotion-related 
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indicators based on non-verbal cues, but they differ in their target: in the first, the purpose is 
predicting attributes of the multimedia stimuli, while in the latter, the human affect is the matter 
in hand. This distinction could be made clear by observing the triangular relationship between 
the video clip’s affective tagging, the facial response to it and the viewer’s reported emotional 
feedback (see Figure 1.1) — implicit media tagging concerns the automated annotation of a stimuli 
directly from spontaneous human response, while affect prediction deals with predicting the 
viewer’s affective state. To be noted that objects and locations identification is not a part of this 
work’s scope (e.g. [88]), but only emotional-related tags. 

Facial 

Expression 


Affective Media 

State Tags 

Figure 1.1: The triangular relationship between the facial expression, the 

MEDIA TAGS AND THE VIEWER’S AFFECTIVE STATE. 


Implicit 

Media 

Tagging 



As opposed to explicit tagging, in which the user is actively involved in the tagging process, 
implicit tagging is done passively, and relies only on the typical interaction the user have with 
such stimuli (e.g. watching a video clip). As such, it is less time and energy consuming, and more 
likely to be free of biases. It has been suggested that explicit tagging tends to be rather inaccurate 
in practice; for example, users tend to tag videos according to their social needs, which yields 
tagging that could be reputation-driven, especially in a setup where the user’s friends, colleagues 
or family may be exposed to their tags [77]. 

In recent years, media tagging had became an integral part of surfing the internet. Many web 
platforms allow (and even encourage) users to label their content by using keywords (e.g. funny, 
wow, LOL) or designated scales (e.g. Facebook’s reactions, see Figure 1.2). Clearly, non-invasive 
methods which produce such tags implicitly can be of great interest. That being said, implicit 
media is complementary (rather than contradictory) to explicit media tagging, and as such can 
be used for assessing the correctness of explicit tags [45], or for examining the inconsistency 
between the intentional (explicit) and involuntary (implicit) tagging [63, 64]. 



Figure 1.2: Facebook’s reactions. 
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Altogether, we learned four different models that share a similar inner-mechanism, but vary 
in their prior knowledge and target. Specifically, two models were trained to predict the clip’s 
affective rating (implicit media tagging), while two other models were trained to predict the 
viewer’s subjective affective state for each individual (affect prediction). Specifically, the first model 
predicts the affective rating of a new clip given the facial expressions of a single viewer to a set of 
known clips, while the second model uses the facial expressions of several viewers. This can be 
useful for the affective tagging of new video clips by a system relying on the facial expressions of 
a group of viewers, who had previously been recorded viewing a set of clips with pre-determined 
affective rating. If no pre-determined affective ratings are available, the third model predicts 
the subjective rating of a new clip, when given the facial expressions and subjective ratings of a 
single viewer to a set of clips which does not include the new clip. 

Affect computation from facial expressions cannot be readily separated from the underlying 
theories of emotional modeling, which rely on assumptions made by researchers in the field of 
psychology and sociology, and in some cases are still under debate. Thus the Facial Action Coding 
System (FACS), originally developed by the Swedish anatomist Carl-Herman Hjortsjo [40], is 
often used to analyze facial activity in a quantitative manner. FACS gives a score to the activity 
of individual facial muscles called Action Units (AUs), based on their intensity level and temporal 
segments. In our work we used the commercial software Faceshift [5] to extract quantitative 
measures of over 50 AUs from recordings of spontaneous facial activity. 

Our goal was to achieve an automatic model that infers descriptive ratings of emotion. To this 
end we employed the dimensional approach to modeling emotion, a framework whose elements 
are bipolar axes constituting the basis of a vector space (not necessarily orthonormal), and where 
it is assumed that every emotion can be described as a linear combination of these axes. We 
represented emotion by a combination of two key scales - Valence and Arousal (see discussion in 
Section 2.1). In addition, we added 2 contemporary scales - Likability and Rewatch (the desire to 
watch again), which are more suitable for modern uses in HCI and media tagging (following [69]). 

Our method is based on learning from a tagged database of video clips. In order to train a 
successful algorithm and be able to test it against some meaningful ground truth, we needed 
a database of video clips which can invoke strong emotional responses in a consistent manner 
across individuals. Since no reliable database of video clips that suites our needs exists at the 
moment, we constructed a new database from publicly available video clips (see discussion in 
Chapter 3). This database is available to the community and can be inspected in [2]. A similar 
empirical procedure was used to collect data for the training and evaluation of our method, as 
described in Chapter 4. 

Next, we developed a vector representation for videos of facial expressions, based on the 
estimated FACS measurements. Specifically, we started from AUs computed automatically by 
Faceshift [5] from a depth video. The procedure by which we have obtained a concise represen- 
tation for each video of facial expressions, which involved quantification of both dynamic and 
spatial features of the AUs activity, is described in Section 4.3. 
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In Section 4.4 we describe the method by which we learned to predict affect from the ensuing 
representation for each viewer and each viewed clip. In the first step of our method we generated 
predictions for small segments of the original facial activity video, employing linear regression to 
predict the 4 quantitative scales which describe affect (namely valence, arousal, likability and 
rewatch, a.k.a. VALR). In the second step we generated a single prediction for each facial expres- 
sions video, based on the values predicted in the first step for the set of segments encompassed 
in the active part of the video. The results are described in Chapter 5, showing high correlation 
between the predicted scores and the actual ones. 

There are three main novelties in this work are: first, our models are based directly on the 
automatically inferred activity of facial muscles over time, considering dozens of muscles, in a 
method which is blind to the actual video being watched by the subject (i.e. the model isn’t aware 
of any of the stimuli details and attributes); Second, the facial muscular activity is measured 
using only a single depth camera; And third, we present a new publicly available database of 
emotion eliciting short video clips, which elicit a strong affective response in a consistent manner 
across different individuals. 
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Theoretical Background and Previous Work 


H uman facial expressions are a fundamental aspect of our every-day lives, as well as a 
popular research field for many decades. It is not always thought-about or spoken-of, 
but many of the decisions we take are based upon other people’s facial expressions, for 
example, deciding whether to ask someone’s phone number ("is he smiling back?"), or whether 
a person is to be trusted. Psychologists have been claiming assiduously that human beings are 
in fact experts of facial expressions (as a car-enthusiast would be an expert of car models), in 
terms of being able to distinguish between highly similar expressions, recognize large amount of 
different faces (of friends, family, co-workers, etc.) and being sensitive to subtle facial gestures. 


2.1 Quantification of Emotion and Facial Expressions 

For over a 100 years now, scholars have been interested in finding a rigorous model for the 
classification of emotion, seeking to express and explain every emotion by a minimal set of 
elements. The work in this field can be divided into two approaches: the categorical approach 
and the dimensional approach (see [36] for a comprehensive survey). These approaches are 
not contradictory, as they basically offer alternative ways to describe the same phenomenon. 
Briefly, the categorical approach postulates the existence of several basic and universal emotions 
(typically surprise, anger, disgust, fear, interest, sadness and happiness), while the dimensional 
approach assumes that each emotion can be described by a point in a 2D/3D coordinate system 
(typically valence, arousal and dominance). 

In this work we collected emotional ratings over the dimensional approach, alongside free 
language descriptions, for several reasons: first, usin g forced choice paradigm for emotion rating 
(compelling subjects to chose an emotion from a closed set of basic emotions) may yield spurious 
and tendentious results, that don’t fully grasp the experienced emotion [37]; second, keeping 
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in mind that mixed feelings can arise from a single stimuli [39, 57], it follows that restricting 
participants to using a set of discrete emotions might bring about the loss of some emotional 
depth; and third, basic emotion tags could be extracted from valence and arousal ratings, as well 
as from the free language text [7, 23]. 

In the 1970s Ekman and Friesen introduced FACS, that was developed to measure and 
describe facial movements [27]. The system’s basic components are Action Units (AUs), where 
most AUs are associated with some facial muscle, and some describe more general gestures, such 
as sniffing or swallowing (see examples in Fig. 2.1). Every facial expression can be described 
by the set of AUs that compose it. Over the years, Ekman and his colleagues expanded the set 
of AUs (and FACS expressiveness accordingly), adding head movements, eye movements and 
gross behavior. Currently there are several dozen AUs, over 50 of them are solely face-related. 
Describing a facial expression in terms of AUs is traditionally done by FACS-experts specifically 
trained for this purpose. 

Automated FACS coding that will replace the manual one poses a major challenge to the field 
of computer vision [74]. AUs extraction can be done using methods based on geometric features, 
such as tracking points or shapes on the face, with features like position, speed and acceleration 
(e.g., [31, 75, 76]), or using appearance based methods based on changes in texture and motion 
of the skin, including wrinkles and furrows (e.g., [8, 14, 15, 61]). The use of geometric features 
tends to yield better results, since appearance based methods are more sensitive to illumination 
conditions and to individual differences, though a combination of both methods may be preferable 
[96]. A newer promising method is based on temporal information in AU activity, which was 
found to improve recognition as compared to static methods [53, 65]. Basic features are classified 
into AUs using model driven methods such as active appearance models (AAMs) [62] or data 
driven methods [85]. While data driven methods require larger sets of data in order to cope with 
variations in pose, lightning and textures, they allow for a more accurate and person-independent 
analysis of AUs [86]. 



Figure 2.1: Examples from the Facial Action Coding System [28]. 
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2.2 Implicit Media Tagging and Affect Prediction 

To our knowledge, tagging of media implicitly and predicting viewers affective state based on 
data obtained from depth cameras is a recent and uncharted territory. A few exceptions include 
Niese et at. [72] who used evaluated 3D information from the raw data obtained by a 2D camera, 
and Tron et al. [99, 100] who used depth cameras to classify the mental state of schizophrenia 
patients, based on their facial behavior. 

The models in affect prediction are designed to predict each viewer’s personal affective state, 
while in media tagging they attempt to predict a media tag based on input from (possibly different) 
viewers. Formally, for person pi who is undergoing the affective state a; while viewing the clip 
Ck, an affect prediction model would be defined as: fipi) = at, while an implicit media tagging 
model would be: g(pt) = c&, and in particular, for viewers i,j it holds that: f(pt) = at, f(pj) = aj 
(when a, isn’t necessarily equals aj), but g(p,) = g(p j) = c&. See illustration in Figure 2.2. 




Affect Prediction 



Implicit Media Tagging 



Figure 2.2: Illustration of the difference between affect 

PREDICTION ( f ) AND IMPLICIT MEDIA TAGGING (g). 


2.2.1 Implicit Media Tagging 

A number of implicit media tagging theories and models have been developed based on different 
modalities, including facial expressions, low-level visual features of the face and body, EEG, eye 
gaze, fMRI, or audio-visual features of the stimuli. 
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Facial Expressions 

Hussein and Elsayed [41] trained a classifier to predict the relevance of a document to a person 
based on her facial behavior, using 16 feature point along the face. Arapakis et al. [9] pursuit 
a similar goal, but the model’s features included the level of basic emotions (e.g., how happy or 
angry the participant is), combined with other peripheral physiological metrics (such as skin 
temperature). They achieved higher success rates (accuracy of 66.5% at best) and showed that a 
user’s predicted emotional states can be used to predict the stimuli’s relevance. 

Jiao and Pantic [45] suggested a facial expressions based model to predict the correctness of 
tags of images. They used 2D cameras to identify 19 facial markers (such as eyebrows boundaries, 
lip corners and nostrils) and used Hidden Markov Models to classify their movement along time. 
Their conclusion was that user’s facial reactions convey information about the correctness of 
tags associated with multimedia data. Similarly, Tkalcic et al. [97] presented a method to predict 
valence, arousal and dominance tags for images, based on the user’s facial expressions (given as 
a sequence of images). 

As for video tagging, most works focus on affect prediction. Among the few that discuss media 
tagging are Zhao et al. [113], who proposed a method that based only on the participant’s facial 
expressions to predict categories of movies (e.g. comedy, drama and horror). Almost all papers 
that use facial expressions for implicit media tagging combine it with additional input, such as 
content from the video itself or additional physiological measures. Bao et al. [12] added the user’s 
acoustic features, motion and interaction with the watching device (tablet or mobile). Wang et 
al. [103] used head motion as well as facial expressions, combined with "common emotions" (the 
emotions that are likely to be elicited from the majority of users) to predict video’s tags, in terms 
of basic emotions. Another common measurement uses the EEG signal (e.g. [54, 90]). 


Other Methods 

The use of EEG in this field is relatively common, presumably because it is non-invasive and 
relatively cheap, permitting the appealing notion that one may be able to tap specific brain 
localization for different emotional states. EEG is popular for tagging of objects and landscapes 
(e.g. [30, 49, 51]), but it is also commonly used for emotion-related tagging. 

Yazdani et al. [110] implemented an EEG-based system that predicts a media’s tag, in terms 
of basic emotions [26]; They achieved an average of 80.19% over a self-developed database of 24 
short video clips. Soleymani et al. used EEG-based methods to tag clips from MAHNOB-HCI 
database [91], once combined with pupillary response and eye gaze [94] and once without [93], 
and reached a success rate of FI = 0.56 for valence and FI = 0.64 for arousal. Wang et al. [104] 
expanded the battery of tools in use, adding features from the video stimuli itself to the EEG 
recordings, to create a fusion method that achieved an accuracy score of 76.1% over valence and 
73.1% over arousal. 
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Han et al. [38] proposed a method that predicts the arousal rating of video clips based on 
fMRI scans and low-level audio-visual features; this method achieved an average accuracy score 
of 93.2% for 3 subjects. Jiang et al. [43] proposed a framework that interweaves fMRI results to 
low-level acoustic features and thus enables audio tagging based on fMRI without actually using 
fMRI further on. Abadi et al. [6] combined MEG signal and peripheral physiological signals (such 
as EOG and ECG) to predict the ratings of movies and music clips, including valence, arousal 
and dominance ratings, with accuracy score of 0.63, 0.62 and 0.59 (respectively) at best. Similar 
to [12], in [89] the authors also proposed a method to estimate movie’s ratings, based instead on 
Galvanic Skin Response (GSR). 


2.2.2 Affect prediction 

Similarly to implicit media tagging, affect prediction models are usually composed of several 
inputs - some physiological (e.g. facial expressions, brain activity or acoustic signals) and some 
are derived from the media the participant is exposed to (e.g. the video clip’s tagging or visual and 
acoustic features). Moreover, they typically rely on the participant’s reported emotional state. 

Lin et al. [59] combined EEG and acoustic characteristics from musical content to evaluate 
the participant’s reported valence and arousal. Similarly, Chen et al. [22] used music pieces as 
a stimuli and combined EEG with subject’s gender data to predict valence and arousal. Zhu 
et al. [114] used video clips as stimuli, and also combined between EEG measurements and 
acoustic/visual features from the clips to predict valence and arousal. Lee et al. [58] had used a 
similar method, but utilized 3D fuzzy GIST for the features extraction and 3D fuzzy tensor for 
the EEG feature extraction, and predicted valence only. The last two had predicted only a binary 
result (positive or negative). 

As for facial expressions, Soleymani et al. [90] compared it to EEG, and found that the results 
from facial expressions are superior to the results from EEG as a predictor for affective state, 
as most of the emotionally valuable content in EEG features is a result of facial muscle activity 
(but EEG signals still carry complementary information). McDuff et al. [69] used a narrow set of 
facial features (namely smiles and their dynamics) to predict participant’s likability and desire to 
watch again of 3 Superbowl ads, with crowdsourced data collected from the participants webcams. 
Bargal et al. [13] were among the first to use deep neural networks on facial expressions (captured 
by 2D camera) to predict affective states (in terms of basic emotions), and among the few to 
propose a model based only on facial expressions. In addition, several papers suggest methods to 
predict participant’s level of interest based on their facial expressions, such as Peng et al. [80] that 
implemented a model based on head motion, saccade, eye blinks, and 9 raw facial components; 
they predicted an emotional score (as well as an attention score) and combined them to predict 
the level of interest. Arapakis et al. [10] proposed a multimodal recommendation system, and 
showed that the performance of user-profiling was enhanced by facial expression data. 
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2.3 Depth Cameras 

The depth camera used in our study employs IR technology, in which infra-red patterns are 
projected onto the 3D scene, and depth is computed from the deformations created by 3D surfaces 
[35]. The technology enables to capture facial surface data, which is less sensitive to head pose 
and to lightning conditions than 2D data, and yields better recognition of AUs [82, 100]. One 
drawback, however, is that the depth resolution is rather low, and therefore the image may 
contain small artifacts in highly reflective and non-reflective regions, or holes in regions not 
covered by the projector [84]. 


2.4 Facial Response Highlight Period 

One substantial component of our models is identifying the most informative time frame in the 
participant’s facial response. In fact, during most of the watching time, most participants’ facial 
behavior showed little to no emotional response; therefore, we sought to find the part of the clip 
where relevant and informative activity was taking place. Money and Agius [71] distinguished 
between two types of highlight period localization techniques: internal and external; the first 
utilized information from the video itself, such as image analysis, objects identification and audio 
features, while the second analyses information which can be obtained irrespective of the video, 
such as viewer’s physiological measures or contextual information, the time of the day, the device, 
or the location in which the viewer is watching the video. 

Both internal and external techniques have been exploited to localize a video’s highlight 
periods. Examples for internal localization are the work of Chan and Jones [21] that extracted 
affective labels (valence and arousal) from the video’s audio content, and the work of Xu et al. 
[108] that proposed a framework based on the combination of low-level audiovisual features with 
more progressive features, such as dialogue analysis for informative keywords and emotional 
intensity. 

Our highlight period localization technique is an external one, that is blind to the video 
clip, and is based solely on the viewers facial expressions (see Section 4.3 for details). Examples 
of external technique are the works of Joho et al. [47, 48] and Chakraborty et al. [20]. Joho 
et al. localized highlight periods from viewers facial expressions, based on two feature types: 
expressions change rate, and pronunciation level - the relative presence of expressions in each of 
three emotional categories: neutral, low (anger, disgust, fear, sadness) and high (happiness and 
surprise). Chakraborty et al. harnessed highlight periods localization with a model composed of 
viewers facial expressions and heart rate, in order to detect sports highlights. 
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T his chapter describes the database of emotion eliciting short video clips we have developed. 
There are currently several available databases of this kind, but unfortunately non of 
them is suited for our needs. In Section 3.1 we review these databases and the relevant 
literature, and subsequently we list the criteria that our database must meet (Section 3.2). The 
full database can be found online [2], also reviewed in Section 3.3. The method we employed to 
collect ratings for the clips is very similar to the method we used for developing our model, and it 
is described in Section 3.4 as well as in Chapter 4. The results and their analysis are described in 
Section 3.5, showing high inter-raters agreement. 

3.1 Eliciting Emotion via Video Clips 

The traditional stimuli used to elicit emotion in human participants under passive conditions 
(without active participation by such means as speech or movement) are sound (e.g., [112]), still 
pictures (e.g., [25, 70]) and video clips (e.g., [16]). Although still pictures are widely used, there 
are several limitation to this method. First, the stimulus is static, thus it lacks the possibility 
of examining the dynamics of an evoked emotion [111]. Second, IAPS [56] is by far the most 
dominant database in use, and therefore almost every emotion-related experiment uses it. Video 
clips are also not free of issues, but it seems that they are the most suitable for eliciting strong 
spontaneous authentic emotional response [29, 73, 106]. Thus, for our needs, an emotion eliciting 
database of video clips is required. 

The first pioneers that developed and released affect tagged video clips databases were [81] 
and [33]. Both used excerpts from well-known films such as Kramer vs. Kramer and Psycho, 
and collected ratings from human subjects about their experienced emotion. The considerable 
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advances in emotional recognition in recent years had motivated scholars to develop and release 
large batteries of emotion eliciting video clips databases, and we review them in Appendix A. 

Initially we defined a set of criteria that our database must meet. However, each of the 
aforementioned databases lacks at least on of them (as could be seen in Appendix A). We 
therefore followed the path taken by many other studies that preferred to collect and validate 
their own databases, or to modify profoundly an existing one (e.g. [42, 60, 68, 73, 95, 98, 107]). 


3.2 Database Criteria 

1. Duration. Determining the temporal window’s length depends on two main principles: 
(i) Avoid clips that elicit several distinct emotions in different times; (ii) Have the ability 
to use many different and diverse clips in a single experiment, without exhausting the 
subjects. Therefore the clips must be relatively short, but still long enough to elicit a strong 
clear emotional response. 

2. VALR Rated. A consequence of our choice of the dimensional approach to describe emotion, 
alongside the scales of likability and rewatch. 

3. No Sensor Presence. The awareness of subjects to being observed or recorded ( i.e . 
Hawthorne Effect) was found to possibly alter their behavior [24, 66, 67]. 

4. Diversity. Clips should be taken from a variety of domains to reduce the effect of individual 
variability. Clearly the database should not contain only clips of cute cats in action or 
incredible soccer tricks, but a balanced mixture. 

5. Globally Germane. Clips must be intelligible regardless of their soundtrack and content 
(e.g. avoid regional jokes and tales). 

6. Unfamiliarity. Clips should be such that uninformed viewers are not likely to be familiar 
with them on one hand, while being publicly available on the other hand. 

7. Not Crowdsourced. Alongside its strengths, crowdsourcing can be problematic. For 
example, subjects are less attentive than subjects in a lab with an experimenter alongside 
them [78], and they differ in their psychological attributes (such as the level of self esteem) 
from other populations [32]. Moreover, Due to our use of depth cameras, the experiment 
must be held in a controlled environment (a depth camera is not a household item), and to 
keep correspondence between the experiment and the database ratings, it should also be 
tagged in the same environment. 

8. Publicly Available. To encourage ratification of results and competition, the data must 
be accessible. 
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3.3 Overview 

The database is composed of 36 short publicly available video clips (6-30 seconds, fi = 20 sec ). Each 
clip was rated by 26 participants on 5 scales, including valence, arousal, likability, rewatch (the 
desire to watch again) and familiarity, and in addition it was verbally described. Table 3.1 gives a 
summary of the database. 


Number of Clips 

36 

Duration (in seconds) 

6-30 (fi = 20.0) 

Number of Raters 

26 (13 males and 13 females) 

Available Scales 

Valence [1-5] 

Arousal [1-5] 

Likability [1-3] 

Rewatch [1-3] 

Familiarity [0, 1-2, 3+] 

Free Text [Hebrew] 


Table 3.1: Emotion elicit database summary. 


3.4 Method 

A highly similar empirical framework was applied in both phases of data collection in this 
work, database construction and models development. Specifically, the depth camera was only 
present in the second phase. Here we only describe in details the clips selection process, and the 
methodology we adopted for the assessment phase is described in Chapter 4. 


Clips Selection Over a 100 clips were initially selected from online video sources (such as 
YouTube, Vimeo and Flickr), to be eventually reduced to 36. We attempted to achieve a diverse 
set of unfamiliar clips, and therefore focused on lightly viewed ones. We excluded clips that 
might offend participants by way of pornography or brutal violence 1 . Several clips were manually 
curtailed to remove irrelevant content, scaled to fit a 1440 x 900 resolution, and balanced to 
achieve identical sound volume. 


Clips Assessment 26 volunteers with normal vision participated in the study, for which they 
received a small payment (13 males and 13 females between the ages 19-29, p = 23.5). For the 
method employed, see Chapter 4. 

*As a rule of thumb, we used videos that comply with the YouTube’s Community Guidelines [4], 
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3.5 Results and Analysis 

All clips were found to be significantly unfamiliar across all raters ( p < .0001), and no influence 
of gender was found. Moreover, ratings on all scales showed high inter-rater agreement with an 
average Intra-Correlation Coefficient (ICC) of 0.945 (two-way mixed model, Cl = .95). The results 
are illustrated in Figure 3.2, and detailed in Table 3.2. 



Mean 

Median 

STD 

min 

max 

ICC 

a 

Valence 

3.04 

3.21 

1.00 

1.23 

4.42 

0.975 

0.973 

Arousal 

3.08 

3.02 

0.65 

1.73 

4.19 

0.926 

0.923 

Likability 

2.02 

2.04 

0.56 

1.12 

2.92 

0.954 

0.952 

Rewatch 

1.73 

1.75 

0.44 

1.08 

2.54 

0.927 

0.924 


Table 3.2: Mean, median, standard deviation and range of the 

DIFFERENT SCALES OVER ALL CLIPS, AS WELL AS INTRA-CORRELATION 

Coefficient (ICC) and Cronbach’s a. 


There were strong correlations between several scales, most notably valence-likability (Pear- 
son’s R = .92), valence-rewatch (R = .87) and likability-rewatch (R = .94). Interestingly, no 
significant correlation was found between arousal and likability (R = -.23) or between arousal 
and rewatch (R = -.04); a possible explanation could be that some high arousal clips could 
be very pleasing (such as hilarious clips), while others are difficult to watch (like car accident 
commercials), as opposed to high valence clips that are unlikely to discontent anyone. As for 
valence-arousal, a small negative correlation was found (R = -.40), possibly because clips with 
extremely high V-A values that mostly included pornographic content were excluded, although 
this result does correspond to prior findings [34]. The results are shown in Figure 3.1. 





Valence 




Arousal 

Arousal-Rewatch (p = -0.04) 

2.6 T ' — r— 1 ’ — 

2.4 . * . * 

2.2 

o 2.0 ****♦♦ • ♦ 


1.4 

1.2 * * \ . 

1.0 1 1 1 1 

1.5 2.0 2.5 3.0 3.5 4.0 4.5 

Arousal 


Figure 3.1: Correlations between valence, arousal, likability and rewatch 
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0 1 2 3 4 5 6 7 


Figure 3.2: Distribution of 4 subjective scores over all tested clips, where 

VALENCE AND AROUSAL DEFINE THE TWO MAIN AXES, ALSO SUMMARIZED IN HISTOGRAM 
FORM ABOVE AND TO THE RIGHT OF THE PLOT. SIZE AND COLOR CORRESPOND 
RESPECTIVELY TO THE REMAINING 2 SCORES OF REWATCH AND LIKABILITY. 
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Method 


D ata collection had two phases: (i) Collect and evaluate a suitable database of video clips 
which elicit strong and consistent emotional responses in their viewers, as described in 
Chapter 3; (ii) Record peoples spontaneous facial expressions when viewing these clips. 
As mentioned, a highly similar empirical framework was used in both phases. In Section 4.1 
we describe the experimental environment and workflow. In Section 4.2 we describe the data 
collection process of the raw recordings, which was used later to calculate the features used in 
our models, and is elaborated in Section 4.3. Following the assemblage of features we learned our 
prediction models, a process that we describe in Section 4.4. 


4.1 Experimental Design 

Participants Data collection was carried out in a single room, well lit with natural light, with 
minimal settings (2 tables, 2 chairs, a whiteboard, 2 computers and no pictures hanging on the 
walls). Participants were university students recruited using banners and posters. 26 volunteers 
with normal vision (13 males and 13 females between the ages 19-29, ju = 23.5 in phase 1, 14 
males and 12 females between the ages 20-28, /i = 23.3 in phase 2) participated in this study, for 
which they received a small payment. 

Data Collection Each data collection session consisted of the following stages (see Figure 
4.1): 

1. A fixation cross was presented for 5 seconds, and the participant was asked to stare at it. 

2. A video clip was presented. 
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Figure 4.1: The Experimental design. 

3. The participant described verbally his subjective emotion to the experimenter, using two 
sentences at most. 

4. The participant rated her subjective feelings on a pre-printed ratings paper (following [19]). 

Data collection was carried out via a self-written Matlab program, designed using Psy- 
chophysics Toolbox Extension [18, 50, 79]. Five discrete scales were used for rating: valence, 
arousal, likability, rewatch (the desire to watch again) and familiarity, alongside free text de- 
scription in Hebrew. Specifically, we used SAM Manikins [55] for valence and arousal, the "liking 
scale" for likability [52], and self-generated scales for rewatch and familiarity (see Figure 4.1). 
The SAM Manikins method was chosen because it is known to be parsimonious, inexpensive and 
quick, as well as comprehensive [17]. After every 4 trials, a visual search task was presented 
(“ Where’s Waldo ?”) in order to keep the participants focused, and to force them to change their 
head situs, sitting position and focal length. The clips order was randomized, and the entire 
procedure lasted for about an hour. We encouraged participants to rate the clips according to their 
perceived, inner and subjective emotion. In addition, each participant completed the Big-Five 
Personality Traits questionnaire [46]. 


4.2 Facial Expressions Recording 

18 of the 36 clips were selected from the database. Aiming for a diverse corpus, we chose clips 
whose elicited response spanned the spectrum of VALR as uniformly as possible, also favoring 
clips with high intra-rater agreement. 

Each participant’s facial activity was recorded during the entire procedure, using a 3D 
structured light camera (Carmine 1.09). Participants were informed of being recorded and signed 


17 




CHAPTER 4. METHOD 


a consent form. 21 of the 26 participants reported after the experiment their belief that they were 
not affected by the recording, and that in fact they had forgotten of being recorded. Moreover, 
the subjective reports in the second phase on all 4 scales had a very similar distribution to the 
reports in the first phase: Valence ( R = .98, p < .0001), Arousal ( R = .90, p < .0001), Likability 
( R = .97, p < .0001) and Rewatch ( R = .95, p < .0001). We therefore believe that the video affect 
tagging obtained in the database assemblage phase remains a reliable predictor of the emotional 
response elicited in the recorded experiment as well. 

4.3 Features 

Previous work in this field calculated facial features either by extracting raw movement of the 
face without relating to specific facial muscles (e.g., [103]), or by extracting the activity level of a 
single or a few muscles (e.g., [69, 101, 109]). In this work we extracted the intensity signals of 
over 60 AUs and facial gestures; this set was further analyzed manually to evaluate tracking 
accuracy and noise levels. Eventually 51 of this set of AUs were selected to represent each frame 
in the clip for further analysis and learning, including eyes, brows, lips, jaw and chin movements 
(see example in Fig. 4.3). 

Using this intensity level representation, providing a time series of vectors in R 51 , we computed 
higher order features representing the facial expression more concisely. This set of features can 
be divided into 4 types: Moments, Discrete States, Dynamic and Miscellaneous. 

Moments. The first 4 moments (mean, variance, skewness and kurtosis) were calculated for 
each AU in each facial video recording. 

Discrete States Features. For each AU separately, the raw intensity signal was quantized 
over time using K-Means ( K = 4), and the following four facial activity characteristic 
features were computed (see Figure 4.2 for an example): 

• Activation Ratio: Proportion of frames with any AU activation. 

• Activation Length: Mean number of frames for which there was continuous AU 
activation. 

• Activation Level: Mean intensity of AU activation. 

• Activation Average Volume: Mean activation level of all AUs, was computed once 
for each expression. 

Dynamic Features. A transition matrix M was generated, measuring the number of transi- 
tions between the different levels described above, and three features were calculated for 
each AU based on it (see Figure 4.2 for an example): 

• Change Ratio: Proportion of transitions with level change. 

• Slow Change Ratio: Proportion of small changes (difference of 1 quantum). 
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(A) (B) 

Figure 4.2: (A) Quantized AU signal (K = 4), and (B) Its corresponding transition 

MATRIX. The NUMBER OF FRAMES LABELED 0 , 1 , 2, 3 IS 6, 3, 5, 5, RESPECTIVELY. THEREFORE: 
ActivationRatio = y§, ActivationLength - ^ AND ActivationLevel = 1.473, AND 
ChangeRatio - , SlowChangeRatio = FastChangeRatio = jg. 


• Fast Change Ratio: Proportion of large changes (difference of 2 quanta or more). 

Miscellaneous Features. Including the number of smiles and blinks in each facial response. 
The amount of smiles was calculated by taking the maximum of the amount of peaks in the 
signals of both lip corners, where peak is defined as a local minimum which is higher by at 
least 0.75 as compared to its surrounding points. The amount of blinks was calculated in a 
similar manner, with a threshold of 0.2. 


Highlight Period 

In most video clips, during most of the viewing time, participants’ facial activity showed almost 
no emotional response. Respectively, considering the entire duration of the facial expression when 
calculating features is not only unnecessary, but could actually harm the model as it adds noise 
to it. We therefore sought to find the time frame of each facial expression in which the relevant 
activity (in terms of affective response) was taking place. Specifically, we implemented a model 
that localized the highlight period solely from the viewer’s facial expression, in a technique that 
is blind to the video clip. 

For each participant and clip, our model receives his/hers muscular intensity levels for the 
clip’s duration (with 6 seconds margins from its beginning and end), and isolates the activity of 
gestures we found to be most informative (namely smiles, blinks, mouth dimples, lips stretches 
and mouth frowns). Later it localizes the 6 seconds window in which these gestures achieves 
maximal average intensity and variance. Excluding moments, All features were computed based 
only on the highlight period. Notice that for some clip C,;, the highlight period might be different 
for every pair of subjects (although its duration will be the same, as it is an simplifying assumption 
of our model). 
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Clip 


4 




Figure 4.3: Facial response (middle row) to a video clip (illustrated in the 

TOP ROW), AND THE TIME VARYING INTENSITY OF AU12 (BOTTOM ROW). 

4.4 Predictive Models 

In this final step we learned two types of prediction models - implicit media tagging models 
( IMT ), and affect prediction models (AP). Both model types predict an affective rank (in VALR 
terms) when given as input a facial expression recording, represented by a vector in feature 
space ( d = 462). Since the number of participants in our study was only 26, a clear case of small 
sample, the full vector representation — if used for model learning — would inevitably lead to 
overfit and poor prediction power. We therefore started by significantly reducing the dimension of 
the initial representation of each facial activity using PCA. Our final method employed a two-step 
prediction algorithm (see illustration in Figure 4.4), as follows: 

First step After the highlight period of each clip was detected (for each subject), it was divided 
into n fixed size overlapping segments. A feature vector was calculated for each segment, 
and a linear regression model (fi) was trained to predict the 4-dimensional affective scores 
vector of each segment. 

Second step Two indicators (mean and std) of the set of predictions over all segments in the 
clip were calculated, and another linear regression model (/j) was trained to predict the 
4-dimensional affective scores from these indicators. 

Several parameters control the final representation of each facial expression clip, including 
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Figure 4.4: Illustration of the two-step prediction algorithm. 


the number of segments, the length of each segment, the percent of overlap between the segments, 
the final PCA dimension and whether PCA was done over each feature type or over all features 
combined. The values of these parameters were calibrated using cross-validation: given a training 
set of l points, the training and prediction process was repeated l times, each time using l-l 
points to train the model and predict the value of the left out point. The set of parameters which 
achieved the best average results over these l cross validation repetitions was used to construct 
the final facial expression representation of all datapoints. 

Different types of predictive models 

Altogether, we learned four different models that shared this mechanism, but varied in their prior 
knowledge and target (see Table 4.1 for formal definition). Specifically, the first two models were 
trained to predict the clip’s affective rating as stored in the database (implicit media tagging), 
while the last two models were trained to predict the viewer’s subjective affective state for each 
individual (affect prediction). 

Implicit media tagging of unseen clips (IMT-1). This model is built using the facial expres- 
sions of a single viewer. Given the facial response of this viewer to a new clip, the model 
predicts the clip’s affective rating. 

Implicit media tagging of unseen clips via multiple viewers (IMT-2). Given the facial 
response of a set of familiar viewers to a new clip, the model predicts the clip’s affective 
rating. Generalizing the first model, this second model predicts the new clip’s affective 
rating by taking into account the prediction of all viewers. 

Viewer’s affect prediction for an unseen clip (AP-1). This model is built using the facial 
expressions of a single viewer. Given the facial response of this viewer to a new clip, the 
model predicts the viewer’s subjective affective state. 

Affect prediction of new viewers (AP*). This model is built separately for each clip, using 
the facial expressions of all the viewers who had watched this clip. Given the facial response 
of a new viewer, the model predicts the new viewer’s subjective affective state when 
viewing this clip. 
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Formally, denote the following: 

• Let e 7 denote the facial expression representation of viewer i to clip j. 

• Let vj ( respectively : a J t , l/, r 7 ) denote the valence ( respectively : arousal, likability, rewatch) 
score of viewer i to clip j. 

• Let s J - = (u 7 ,a 7 ,Z 7 ,r 7 'J denote the affective vector score of viewer i to clip j. 

• Let v 7 ( respectively : a 7 , L, r 7 ) denote the valence ( respectively : arousal, likability, rewatch) 
rating of clip j. 

• Let c 7 = (v 7 ,a 7 ,l 7 ,r 7 ) denote the affective vector rating of clip j. 


Model Name 

Input 

Model’s supervision 

Output 

Comment 

IMT-1 

e k 

° m 

\e j } 18 
\ m lj=ij?k 

c k 

A unique model is built for 
each viewer m. Note that the 
model is utterly unfamiliar 
with clip k. 

IMT-2 

\e k \ 

Allies, Sc [26] 

f t 1 18 1 26 

i \ e ; 1 f 

U *0=ij**J i= i 

c k 

The output is a single model, 
which predicts the clip’s rank 
(c* ) based on the average of the 
m models computed in IMT-1. 

AP-1 

e k 

° m 


S k m 

Similarly to IMT-1, a unique 
model is built for each viewer 
m. The model is also unfamil- 
iar with clip k. 

AP* 



ki;:, 

The model predicts s J m sepa- 
rately for each clip j, and is 
also utterly unfamiliar with 
viewer m. 


Table 4.1: Summary of models learned. 
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Results and Analysis 


T o evaluate the predictive power of our models, we divided the set of recordings following a 
Leave-One-Out (LOO) procedure. Specifically, for IMT-1, IMT-2 and AP-1 we trained each 
model based on/i-1 clips ( n = 18) and tested our prediction on the clip which was left out. 
For AP* the procedure was identical, up to leaving a viewer out (instead of a clip), hence n = 26. 
The results are shown in Section 5.1, followed by an analysis of the relative importance of the 
different feature types (moments, discrete state features, dynamic features and miscellaneous 
features) in Section 5.2, and the localization of highlight period in Section 5.3. 

5.1 Learning Performance 

Learning performance was evaluated by Pearson’s R between the actual VALR scores and the 
models’ predicted ones (see example in Figure 5.1). Table 5.1 shows the average VALR results 
over all clips/viewers (all correlations are significant, p < 0.0001). 



Valence 

Arousal 

Likability 

Rewatch 

IMT-1 

.752 (. 14 ) 

.728 (. 07 ) 

.637 (.22) 

.661 (. 15 ) 

IMT-2 

.948 (.22) 

.874 (.22) 

.951 (. 17 ) 

.953 (. 19 ) 

AP-1 

.661 (. 17 ) 

.638 (. 19 ) 

.380 (. 16 ) 

.574 (. 19 ) 

AP* 

.561 (. 26 ) 

.138 (.21) 

.2 75 (. 23 ) 

.410 (. 17 ) 

Report/Tags 

.783 (. 08 ) 

.461 (. 33 ) 

.659 (. 13 ) 

.561 (. 23 ) 


Table 5.1: Mean (and std) of Pearson’s R between the predicted 

AND ACTUAL RANKS, AS WELL AS THE AVERAGE CORRELATION BETWEEN 
THE SUBJECTIVE REPORT AND MEDIA TAGS. 
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Figure 5.1: Correlation example between predicted and actual ratings of a 
SINGLE VIEWER’S VALENCE SCORE ( R=0.791 ). 

Notably, implicit media tagging models reached higher success rates than the affect prediction 
ones. In particular, although IMT-1 and AP-1 both predict an affective state when given a familiar 
viewer’s facial response to a new clip, the first yields noticeably higher success rates (p = .131). 
Based on these results, it could be suggested is that it is easier to predict the expected affective 
state from viewer’s facial expressions than the reported one. This is rather surprising, as the 
opposite sounds more likely - that one’s facial behavior would be a better predictor to his/hers 
subjective emotion than their expected one. Under the assumption that the participants in this 
research proclaimed the true experienced emotion they underwent (as requested), it implies that 
human facial expressions are more faithful to the actual inner emotion than to the one reported. 
One reservation is that the viewer’s subjective report is given on a discrete scale, while the media 
is tagged on a continuous one (as these are average ratings), thus the correlation is likely to 
be higher for the latter. This problem is averted by comparing the results after binarizing the 
predictions. 

Binary prediction 

Apart from the aforementioned reason, often what is needed in real-life applications is a discrete 
binary prediction rather than a continuous grade, in order to indicate, for example, whether the 
viewer likes a video clip - or not. To generate such a measure, we binarized the actual ratings and 
the predicted ones, using the corresponding mean of each measure as a threshold, and calculated 
the accuracy score (ACC) between them. Since the scores around the mean are rather ambiguous, 
we eliminated from further analysis the clips whose original tag was uncertain. This included 
clips with scores in the range p±a, where p denotes the average score over all clips and a its 
std, thus eliminating 15% on average of all data points. The results can be found in Table 5.2. 

Comparing the results of IMT-2 to published state-of-the-art methods (e.g. [69]) demonstrates 
great success, as an ability to tag media in VALR terms with accuracy rates ranging around 90% 
is unprecedented. That being said, it is important to note that these results were obtained using 
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Valence 

Arousal 

Likability 

Rewatch 

IMT-1 

71% (. 15 ) 

56% (. 14 ) 

68% (.12) 

73% (. 13 ) 

IMT-2 

94% (.20) 

90% (. 18 ) 

91% (.21) 

91% (.22) 

AP-1 

70% (. 17 ) 

53% (.22) 

49% (. 18 ) 

49% (. 27 ) 

AP* 

64% (.22) 

50% (. 17 ) 

40% (. 25 ) 

42% (. 26 ) 


Table 5.2: Accuracy (and std) of the derived binary measure. 

a newly composed database (presented in Chapter 3), therefore further research must be carried 
out for balanced comparison between methods. 

Regardless to whether the scale is discrete, the ratio between the analogous algorithms 
(namely IMT-1 and AP-1) remains similar. Hence, these results support the aforementioned 
claim that human facial behavior is more faithful to the actual inner emotion than to the one 
reported. Furthermore, facial expressions allow for better predictions of media tags than the 
viewer’s subjective rating, as the success rates of IMT-1 are generally higher than the correlation 
between the viewer’s reported affective state and the media tags (see Report / Tags is Table 5.1). 
These findings are supported by theories of emotional self-report bias, that could arise from many 
factors (such as cultural stereotypes, the presence of an experimenter, and the notion of the 
reports being made public). 

Unfortunately, the binary predictions of AP* are generally no better than random predictions. 
Furthermore, for continuous ranks, the algorithm’s predictive power is limited; when a model 
supervises a group of viewers (like AP*), it is preferable to predict a new viewer’s affective state 
using only their affective reports, without even relying on their facial behavior, as the average 
affective rank of n - 1 viewers provides a more accurate prediction to the n-th viewers (namely 
One-Viewer-Out method), as could be seen in Table 5.3. In other words, when given a group 
of viewers’ facial expressions and affective ranks, using the average of their ranks would yield 
better prediction for a new viewer’s rank than his/hers facial expression. 



Valence 

Arousal 

Likability 

Rewatch 

AP* 

.561 (. 26 ) 

.138 (.21) 

.2 75 (. 23 ) 

.410 (. 17 ) 

Average ( n - 1) 

.779 (. 08 ) 

.472 (. 29 ) 

.645 (. 13 ) 

.552 (.21) 


Table 5.3: Mean (and std) of Pearson’s R between the predicted 

AND ACTUAL RANKS OF AP* AND ONE-VlEWER-OUT METHOD. 


In addition, although seemingly AP-1 yields rather accurate predictions, deeper inspection 
reveals that an alternative mechanisms based on IMT-1 could be suggested, namely return the 
predictions of IMT-1 instead; this mechanism yields more accurate predictions of the subjective 
affect rank than AP-1. In other words, if the media’s tags are also available, it’s preferable to train 
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a model to predict these tags and rely on this model alone, rather than on a model trained to 
predict the viewer’s subjective rank. Moreover, if other viewers data is also available, relying on 
it (namely returning IMT-2’s predictions) is even more superior. We demonstrate this observation 
by inspecting the mean error (ME) between the actual subjective ranks and the predictions given 
by the AP-1, IMT-1 and IMT-2 models, see Table 5.4. 



Valence 

Arousal 

Likability 

Rewatch 

AP-1 

.265 

.370 

.388 

.371 

IMT-1 

.214 

.262 

.325 

.341 

IMT-2 

.179 

.239 

.295 

.306 


Table 5.4: Mean error of all viewers subjective ranks and AP-1, 
IMT-1 and IMT-2’s predictions. 


5.2 Relative Importance of Features 

We analyzed the relative importance of the different facial features for IMT-1, observing that 
different facial features contributed more or less, depending on the affective scale being predicted. 
Features’ relative importance was calculated by learning the models as described above, while 
using only a single type of features at each time, and comparing the predictions’ success rates. 
For example, we observed that for IMT-1 the prediction of valence relied on all 4 feature types 
(including moments, discrete state features, dynamic features and miscellaneous features), while 
the prediction of arousal didn’t use the miscellaneous features at all, but relied heavily on 
the dynamic aspects of the facial expression. Similarly, the prediction of likability utilized the 
miscellaneous features the most, while not using the moments features. Specifically, prediction 
with only miscellaneous features achieved correlation of R = 0.275 with the likability score, and 
R = 0.287 with the rewatch score. These observations are summarized in Figure 5.2. 

As a comparison, we analyzed relative importance of the different facial features for the 
analogous affect prediction algorithm (namely AP-1). As can be seen in Figure 5.3, the distribution 
is similar in both models, except that the dynamic features’ relative importance is consistently 
higher in affect prediction (ji = +11.75%). Because the dynamic features are relatively more 
dominant in predicting subjective emotion, this observation encourages us to hypothesize that 
the temporal aspects of viewers’ facial behavior, in addition to serving as a good predictor for 
emotions in general, could be used as a distinguishing property between different viewers. This 
idea is in line with [100], where it was shown that these aspects helps distinguishing between 
schizophrenia patients and healthy individuals. 
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(a) Valence 


(b) Arousal 


Discrete 




(c) Likability 

Figure 5.2: The relative contribution of different feature groups to IMT-1. 



(a) Valence 



(b) Arousal 

Moments 



(c) Likability 

Figure 5.3: The relative contribution of different feature groups to AP-1. 


5.3 Localization of Highlight Period 

We also analyzed the relative location of the response highlight period (HP) within the clip. 
Although this period was computed bottom-up from the facial recording of each individual 
viewer and without access to the observed video clip, the correspondence between subjects was 
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notably high ( ICC = .941). Not surprisingly, the beginning of the period was usually found a few 
seconds before the clip’s end (/u = -7.22, a = 4.14), and in some clips it lasted after the clip ended 
(specifically in 8 out of the 18 clips). Yet the HP localization clearly depended, in a reliable manner 
across viewers, on the viewed clip. For example, when viewing a car safety clip, the average HP 
started 14 seconds before its end, probably because a highly unpleasant violent sequence of car 
crashes had began a second earlier. We may conclude that the HP tends to focus around the clip’s 
end most of the times, but clip-specific analysis is preferable in order to locate it more precisely. 
The distribution of HPs is presented in Figure 5.4. 
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Figure 5.4: Histogram of HPs relative to the clips’ end time, which marks the 
ORIGIN OF THE Z-AXIS (p = -7.22, a = 4.14, x 2 = 0.86). 
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Discussion 


O ur contribution in this work is two fold. First, we obtained a database of video clips which 
give rise to strong predictable emotional response, as verified by an empirical study, and 
which is available to the community Second and more importantly, we described several 
algorithms that can predict emotional response based on spontaneous facial expressions, recorded 
by a depth camera. Our method provided a fairly accurate prediction for 4 scores of affective 
state: Valence, Arousal, Likability, and Rewatch (the desire to watch again). We achieved high 
correlation between the predicted scores and the affective tags assigned to the video. In addition, 
our results suggest that a group of viewers performs better as a predictor to media tagging than 
a single viewer (similarly to idea of "The Wisdom of Crowds"), as IMT-2 achieves evident higher 
success rates than IMT-1. Hence, in real-life systems, it’s preferable to rely on the facial behavior 
of a group of known viewers than a single one, if possible. 

When using facial expressions for automatic affect prediction, We saw that it’s easier to predict 
the expected affective state (i.e. implicit media tagging) than the viewer’s reported affective state 
(i.e. affect prediction). Further analysis evinced that a prediction based on the media tags provides 
a more accurate estimation for the viewers’ affective state than a prediction based on the viewer’s 
report. These results are rather surprising, and we believe that further effort to improve the 
affect prediction models could obtain better results. One possible course of action to expand 
the predictive power is utilizing viewers personality properties by using the Big-5 Personality 
Traits [46], that were also collected in this study. Another approach is considering head and 
upper-body gestures, that could be also captured by depth cameras, as it has been showed that 
body movements contribute to emotion prediction [11]. 

Interestingly, when computing the period of strongest response in the viewing recordings, we 
saw high agreement between the different viewers. Further analysis revealed that different types 
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of facial features are useful for the prediction of different scores of emotional state. For example, 
we saw that simply counting the viewer’s smiles and blinks (miscellaneous features) provided 
an inferior, yet significantly correlated, prediction of Likability and Rewatch. For commercial 
applications, these facial features can be obtained from the laptop’s embedded camera (in a 
similar manner to [69]). Furthermore, we found that the dynamic aspects of facial expressions 
contribute more to the prediction of viewers affective state than to the prediction of media tags. 

In a wider perspective, we recall that on April 2005, an 18 seconds video clip titled "Me at 
the zoo" became the first video clip uploaded to YouTube. In the decade since, the world had 
witnessed an unprecedented growth of online video resources — on YouTube alone there are over 
1.2 billion video clips; adding other popular websites like Vimeo, Dailymotion and Facebook, and 
we reach an un-grasped amount of cuddling cats, soccer tricks and recorded DIY manuals. On 
April 2016 the CEO of Facebook, Mark Zuckerberg, stated that within 5 years Facebook will be 
almost entirely composed of videos; and with 1.8 billion active users that watch over 8 billion 
videos per day [3], that is a statement that should be taken seriously. 

Such a staggering amount of videos poses many challenges to computer scientists and 
engineers. Since every user can upload videos as he desire, classifying and mapping them for 
accurate and quick retrieval, ease of use, search engines and recommendation systems becomes 
a challenging task. One solution is tagging the videos — assigning descriptive labels that aids 
indexing and arranging them. Another challenge is to understand the viewers’ expected emotional 
response, to refine personal customization and to comprehend the effect of the videos on the users. 
These are the challenges we tackled in this work. 
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Review of Emotion Elicit Databases 


T he following pages includes a review of all major emotion eliciting databases. To be noted 
that the databases reviewed here might be partially to the ones released, as only the 
emotion eliciting clips are discussed. For example, both DEAP [52] and MAHNOB-HCI 
[91] contains major parts of subjects’ physiological signals that are not mentioned in this review. 
Drawbacks are numbered with respect to Section 3.2 
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